ABSTRACT

Even though there are various source code plagiarism detection approaches, most of them are only concerned with lexical similarities attack with an assumption that plagiarism is only conducted by students who are not proficient in programming. However, plagiarism is often conducted not only due to student incapability but also because of bad time management. Thus, semantic similarity attacks should be detected and evaluated. This research proposes a source code semantic similarity detection approach that can detect most source code similarities by representing the source code into an Abstract Syntax Tree (AST) and evaluating similarity using a Siamese neural network. Since AST is a language-dependent feature, the SOCO dataset is selected which consists of C++ program codes. Based on the evaluation, it can be concluded that our approach is more effective than most of the existing systems for detecting source code plagiarism. The proposed strategy was implemented and an experimental study based on the AI-SOCO dataset revealed that the proposed similarity measure achieved better performance for the recommendation system in terms of precision, recall, and f1 score by 15%, 10%, and 22% respectively in the 100,000 datasets. In the future, it is suggested that the system can be improved by detecting inter-language source code similarity.

Keywords: - Source Code, Lexical plagiarism, Semantic neural network